Read the data

Firstly, we load the datasets and required libraries

# Load libraries
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(car)
library(plotly)

# Read the data
df <- as.tibble(read.csv("gapminder_clean.csv", row.names =1))

Walkthrough analysis

Filter the data, to include only year 1962

df_1962 <- df %>%
  dplyr::filter(Year==1962)

Plot the correlation

The plot above shows a correlation between selected values. Therefore, we are conducting pearson’s r calculation for them

cor.test(df_1962$gdpPercap, df_1962$CO2.emissions..metric.tons.per.capita., method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  df_1962$gdpPercap and df_1962$CO2.emissions..metric.tons.per.capita.
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8934697 0.9489792
## sample estimates:
##       cor 
## 0.9260817

The calculation suggests, a strong positive correlation between gdpPercap and CO2 emissions with r=0.9260817 and p-value = 2.2e-16.

Find the correlation across all years

# Get all the years (time point) from a dataframe
years <- unique(df$Year)
# Calculate the correlation and store in a list (with years) for all year
res <- lapply(years, function(x){
  filt_df <- df %>%
    filter(Year==x) %>%
    transmute(percap= gdpPercap, co2=CO2.emissions..metric.tons.per.capita.) %>%
    drop_na()
  corr <- cor.test(filt_df$percap, filt_df$co2, method = "pearson")
  list(x,as.numeric(corr$estimate))
})
# Make a vector from a corrs
corrs <- sapply(res, function(x){
  x[[2]]
})
# Find the year, which holds a max pearson's r. 
max_year <- sapply(res, function(x){
  if (x[[2]] == max(corrs) ){
    x[[1]]
  }
})
# Clean NULL elements from an above list. The max year is just a number now
max_year <- as.numeric(max_year[lengths(max_year) != 0] )

Therefore we have the max year stored in a variable max_year (=1967). Then we are constructing an interactive plot, using this year as a filter.

Exploratory data analysis without an instruction

Relationship between continent and Energy use variables

The first thing, first - we should plot the data to get the know it

Stats

The plot suggests that we have one categorical variable and one quantitive The best statistical test for this kind of problem is ANOVA test. However there are several assumptions, we need to secure before applying test 1. Observations obtained independly and randomly 2. The data of each factor level is normally distributed 3. These populations have common variance

The first assumption is secured, we assume that these variables obtained independently and randomly. For the rest, we should run some tests

Let’s filter the data for the tests

df_energy <- df %>%
  transmute(continent=continent, energy=Energy.use..kg.of.oil.equivalent.per.capita.) %>%
  drop_na() %>%
  filter(continent != "")

Check the variance homogenuity

To test this assumption, we would use Levene’s test from car package, because it’s less sensitive to departures from normal distribution, than the analogous Bartlett’s test

leveneTest(energy ~ continent, data=df_energy)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   4  12.424 8.003e-10 ***
##       843                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis in Levene’s test is that groups have common variance. We just broke it with p-value 8e-10. Therefore we cannot assume that the variance is common between population

Check if population have normal distribution

We should also check if the data is normally distributed between continents. For that we should use Shapiro-Wilk test.

continents <-  unique(df_energy %>% select(continent))$continent

res <- lapply(continents, function(x){
  df_tmp <- df_energy %>%
    filter(continent==x)
  list(shapiro.test(df_tmp$energy), x)
  })

for (i in res){
  print(paste(i[[2]], "have W: ", i[[1]]$statistic, " with p-value ", i[[1]]$p.value))
}
## [1] "Europe have W:  0.896467349504789  with p-value  2.99948739634846e-12"
## [1] "Africa have W:  0.674723638575491  with p-value  2.3343308284197e-19"
## [1] "Americas have W:  0.563222686679644  with p-value  1.5868536782942e-21"
## [1] "Oceania have W:  0.981809862485087  with p-value  0.955266147969386"
## [1] "Asia have W:  0.660991122763317  with p-value  4.94311060721094e-19"

The data above suggest, that the normality assumption is violated. However, the Shapiro-Wilk test tends to pinpoint the slightest normality violation, if the data is big (>50 variables). Therefore, it always better to check with QQ-plots

The QQ-plots aligns with the results of the shapiro-wilk test. The only data for the Oceania have normal distribution. Just to be sure, let’s print the stats for each continent

group_by(df_energy, continent) %>%
  summarise(
    count = n(),
    mean = mean(energy, na.rm = TRUE),
    median=median(energy, na.rm = TRUE),
    sd = sd(energy, na.rm = TRUE)
  )
## # A tibble: 5 x 5
##   continent count  mean median    sd
## * <chr>     <int> <dbl>  <dbl> <dbl>
## 1 Africa      199  699.   450.  627.
## 2 Americas    188 1704.   749. 2377.
## 3 Asia        185 1867.   760. 2590.
## 4 Europe      256 3146.  3028. 1734.
## 5 Oceania      20 3980.  4045. 1123.

The small amount of data partly explains the Oceania normality. Given that the Energy use grow with the passage of time, we expect non-normal distribution across 10 years.

So, we should non-parametric tests instead. The analogous to ANOVA non-parametric test is Kruskal-Wallis test

kruskal.test(energy ~ continent, data = df_energy)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  energy by continent
## Kruskal-Wallis chi-squared = 318.68, df = 4, p-value < 2.2e-16

The Kruskal-wallis test us, that there is an statistically significant difference in medians with the populations. However we haven’t compared them pairwise. Because the normality assumption for 4 out of 5 groups is violated we could use pairwise comparisons with Wilcoxon rank sum test.

pairwise.wilcox.test(df_energy$energy, df_energy$continent,
                 p.adjust.method = "BH")
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  df_energy$energy and df_energy$continent 
## 
##          Africa  Americas Asia    Europe 
## Americas 1.9e-14 -        -       -      
## Asia     0.00016 0.09746  -       -      
## Europe   < 2e-16 < 2e-16  < 2e-16 -      
## Oceania  1.1e-12 9.4e-08  3.5e-07 0.00424
## 
## P value adjustment method: BH

The results above suggest, that there is a significant difference in energy use over the 10 time points for 5 continents. The only not significant difference with p-value of 0.09746 is between Asia and Americas.

Let’s plot the final plot with all the stats included

##Imports of goods and services analysis

The point of this analysis is to check whether there is a significant difference in terms of imports between Europe and Asia in years after 1990 Let’s filter the data first

df_1990 <- df %>%
  filter(Year>1990) %>%
  transmute(continent=continent,import=Imports.of.goods.and.services....of.GDP. ) %>%
  filter(continent=="Asia" | continent=="Europe") %>%
  drop_na()

Let’s create a simple boxplot to get the sense of a data

We have two groups, and we check the difference between them. We cannot use one-sample t-test, because the groups are independent, there are no reference to compare one of them. Therefore, the Unpaired Two-Samples T-test is a better option. But we first need to meet the assumption for this parametric test: 1. Normal distribution within groups 2. Variance in these groups is equal

Let’s perform the normality check using shapiro-wilk test and using QQ-plot

continents <-  unique(df_1990 %>% select(continent))$continent

res <- lapply(continents, function(x){
  df_tmp <- df_1990 %>%
    filter(continent==x)
  list(shapiro.test(df_tmp$import), x)
  })

for (i in res){
  print(paste(i[[2]], "have W: ", i[[1]]$statistic, " with p-value ", i[[1]]$p.value))
}
## [1] "Asia have W:  0.854896735520448  with p-value  2.30969161420933e-08"
## [1] "Europe have W:  0.929048540991224  with p-value  1.36333544206785e-05"

And let’s explore the basic stats for each group

group_by(df_1990, continent) %>%
  summarise(
    count = n(),
    mean = mean(import, na.rm = TRUE),
    median=median(import, na.rm = TRUE),
    sd = sd(import, na.rm = TRUE)
  )
## # A tibble: 2 x 5
##   continent count  mean median    sd
## * <chr>     <int> <dbl>  <dbl> <dbl>
## 1 Asia         98  46.8   39.8  33.5
## 2 Europe      114  41.8   37.8  16.7

We can see, that the normality assumption is violated. Shapiro-Wilk test gave the p<0.05 for each group. Given the sensitivity of this test, we plotted the QQ-plot, which, unfortunately, confirms the violation of this assumption.

Then let’s If the variances are equal between groups

var.test(import ~ continent, data = df_1990)
## 
##  F test to compare two variances
## 
## data:  import by continent
## F = 4.0189, num df = 97, denom df = 113, p-value = 3.696e-12
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  2.740396 5.927773
## sample estimates:
## ratio of variances 
##           4.018893

As we see, the homogeneity of the variances rule is also violated (p-value<<0.05)

So, then we should use the non-parametric alternative, which is Unpaired Two-Samples Wilcoxon Test.

wilcox.test(import~continent,data=df_1990, alternative = "two.sided")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  import by continent
## W = 5707, p-value = 0.7867
## alternative hypothesis: true location shift is not equal to 0

Well, the difference between two groups is not statistically significant (p-value=0.7867).

And the final visualization^

Highest Population density across coutries

Let’s transform the initial data, to save only population density, country and year

df_pop <- df%>%
  transmute(year=Year, country=Country.Name, pop_den=Population.density..people.per.sq..km.of.land.area.) %>%
  drop_na()

Let’s take a look at a data with an interactive plot.

We could tell right away, that the highest population density across all years is Macao SAR, China or Monaco. However, we cannot tell which a winner across all years.

# Get unique counties and years to a vectors
countries <- unique(df_pop$country)
years <- unique(df_pop$year)
# Make an empty dataframe with countries, which we will populate
coutry_rank <- as.data.frame(unique(df_pop$country))
colnames(coutry_rank) <- "country"
# Iterate over years, rank the population density, add ranks to a dataframe
for (i in years){
  df_tmp <- df_pop %>%
    filter(year==i) 
  
  order(df_tmp$pop_den)
  data_ranked <- as.data.frame(cbind(df_tmp$country, as.numeric(rank(df_tmp$pop_den))))
  coutry_rank <- cbind(coutry_rank, as.numeric(data_ranked[match(coutry_rank$country, data_ranked$V1), ]$V2))
}
# Rename columns
colnames(coutry_rank) <- c("country", years)
# Construct a matrix out of dataframe only with numbers (to compute mean)
rank_matrix <- coutry_rank %>%
  select(-country) %>%
  as.matrix()
coutry_rank <- cbind(coutry_rank, rowMeans(rank_matrix))
coutry_rank[is.na(coutry_rank)] = 0
colnames(coutry_rank) <- c("country", years, 'average')

# Print the rows of a dataframe with the highest rank (the bigger, the better)
coutry_rank %>%
  filter(average==max(average))
##            country 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 average
## 1 Macao SAR, China  252  252  253  252  252  253  257  259  262  261   255.3
## 2           Monaco  253  253  252  253  253  252  256  258  261  262   255.3

So, then we conclude, that the countries, that have the highest population density over years are Macao SAR, China and Monaco.

Life expectancy increase

At first filter the data. Select years, life expectancy, country

df_life <- df %>%
  transmute(year=Year, country=Country.Name, life=Life.expectancy.at.birth..total..years.) %>%
  drop_na()

Look at the data.

The plot is less than informative. We should pinpoint the greatest increase in life expectancy since 1962. In other words, the greatest difference between 2007 and 1962. Let’s compute it

df_life <- df_life %>%
  pivot_wider(names_from = year, values_from=life)

matrx <- df_life %>%
  select(-country)

res <- apply(matrx, 1, function(x){
  lst <- x[!is.na(x)]
  diffs <- lst[length(lst)]-lst[1]
  diffs
})

df_life$diff <- res
top_diffs <- filter(df_life, diff %in% sort(df_life$diff, decreasing = T)[1:10])[c(1,2,11,12)]
top_diffs
## # A tibble: 10 x 4
##    country            `1962` `2007`  diff
##    <chr>               <dbl>  <dbl> <dbl>
##  1 Bhutan               33.1   66.3  33.2
##  2 China                44.4   74.3  29.9
##  3 Iran, Islamic Rep.   46.1   72.7  26.6
##  4 Maldives             38.5   75.4  36.9
##  5 Nepal                36.0   66.6  30.6
##  6 Oman                 44.3   75.1  30.8
##  7 Saudi Arabia         46.7   73.3  26.7
##  8 Timor-Leste          34.7   65.8  31.1
##  9 Tunisia              43.3   74.2  30.9
## 10 Yemen, Rep.          34.7   62.0  27.2

Lets make the plot one more time, but color the top countries this time